Record: Full GPTQ + LeakyReLU² + Parallel Muon + BigramHash 3072 (val_bpb 1.1163, 3-seed mean) by abaybektursun · Pull Request #593 · openai/parameter-golf

abaybektursun · 2026-03-24T01:45:34Z

Summary

val_bpb: 1.1163 (3-seed mean, std 0.0012) | ~15.90 MB | 8×H100 SXM, 600s | No TTT

3-Seed Results

Seed	step_avg	steps	Pre-quant bpb	Post-GPTQ sliding bpb	Artifact
42	83.4ms	7,192	1.1349	1.1149	15,895,636
1337	83.4ms	7,195	1.1370	1.1172	15,899,284
2024	83.5ms	7,190	1.1367	1.1167	15,904,036
Mean	83.4ms	7,192	1.1362	1.1163 (std 0.0012)

Key Techniques

Full Hessian GPTQ. Second-order Hessian-aware quantization with Cholesky error compensation and column reordering (actorder). GPTQ improves post-quantization BPB by 0.0199 vs pre-quantization.

Parallel Muon optimizer. Parameter Banking (batched Newton-Schulz, 15× faster optimizer) + async reduce-scatter/all-gather communication overlap. 83ms/step vs ~89ms baseline.

BigramHash 3072×80. Budget-optimal allocation of the 16MB artifact limit — more hash buckets with narrower embeddings. Coverage beats fidelity: each bigram embedding passes through a learned 80→512 projection that reconstructs useful features from narrower input, while doubling buckets from 1536→3072 halves hash collisions, capturing more unique bigram patterns. Narrower embeddings also compress better under GPTQ+lzma (random-looking vectors have high entropy), freeing bytes for the larger table.

Architecture

Component	Setting
Layers	11 (512d, 8H, 4KV)
MLP	3× with LeakyReLU(0.5)²
BigramHash	3072 buckets, dim=80
XSA	Last 4 layers
RoPE	Partial (16/64 dims)
LN Scale	1/√(layer+1)
VE128	Layers 9-10
Weight avg	EMA(0.997) + Tight SWA(every 50)
Quantization	Full Hessian GPTQ int6 + lzma(9)
Optimizer	Parameter Banking + Parallel Muon

Credits

Full GPTQ in competition: PR #535 by @raahilshah (first Full GPTQ), PR #569 by @gowtham0992 (VRL + GPTQ)
LeakyReLU²: PR #493 by @parinzee, PR #518 by @sofiabod
Optimizer (Parameter Banking + Parallel Muon): PR #399 by @abaybektursun
Base model: PR #414 by @signalrush
GPTQ algorithm: Frantar et al., ICLR 2023

🤖 Generated with Claude Code

…ed mean) Full Hessian GPTQ (Cholesky error compensation, actorder) replaces GPTQ-lite, improving post-quant BPB by 0.0048. LeakyReLU(0.5)² activation. No TTT needed. 3-seed results: Seed 2025: 1.1167 bpb, 15.90 MB Seed 1337: 1.1171 bpb, 15.96 MB Seed 2024: 1.1173 bpb, 15.99 MB Mean: 1.1170 (std 0.0003) All artifacts under 16MB. Eval ~185s (well within 10 min). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…x LeakyReLU author links

Novel: XSA-all(11) + selective ±1 pruning on openai#593 base. Parallel Muon gives 83ms/step = competition-grade throughput.

XSA-all(11) on openai#593 Parallel Muon stack. 83ms/step, 6923 steps. 15.94MB fits. Novel contribution: XSA-all + selective pruning.

Beats merged openai#414 by 0.0073 nats. Meets record threshold. Stack: openai#593 + XSA-all(11) + selective ±1 pruning. Ready for submission.

BigramHash 1536×128 → 3072×80: coverage-over-fidelity budget reallocation. More hash buckets capture more bigram patterns; narrower embeddings compress better under GPTQ+lzma, freeing bytes for the larger table. GPTQ memory fix: free training model before Hessian calibration to prevent OOM with the larger BigramHash optimizer state. 3-seed results: 1.1149, 1.1172, 1.1167 (mean 1.1163, std 0.0012) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@parinzee

…, LeakyReLU from @parinzee + @sofiabod

valerio-oai · 2026-03-24T14:53:57Z

As per your table, you are counting the GPTQ calibration as an eval-time intervention. However, your implementation reuses training data for it, meaning it accesses training data at eval time, which is forbidden. Closing for now: if you want this calibration to count as training, it should be counted as part of the training 600s-budget, not the eval budget.

…xH100 30+ experiments on the PR openai#593 stack (1.1171 BPB), all negative or marginal: - CUTLASS SM90 GEMM: 2.5x slower than cuBLAS - Fused Triton GEMM+activation: autograd.Function kills backward - FP8, QKV fusion, custom CUDA: all slower or no improvement - SpinQuant, mixed int5/int8, Soft-Round QAT: noise-level - XSA-all, VRL, Gated Attention, bigger model, shard ordering: all worse - 22 legal TTT experiments: all worse than non-TTT baseline Key finding: 82ms step is 95%+ optimized. torch.compile handles all fusion. Competition at d=512 is bits-per-parameter, not FLOPS-per-second. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

abaybektursun changed the title ~~Record: Full GPTQ + LeakyReLU² + Parallel Muon — val_bpb 1.1170 (3-seed mean, no TTT)~~ Record: Full GPTQ + LeakyReLU² + Parallel Muon — val_bpb 1.1171 (3-seed mean, no TTT) Mar 24, 2026

Fix mean BPB 1.1170→1.1171, seed 2024 steps 7185→7201

4c3c1de

notapplica mentioned this pull request Mar 24, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

Fix GPTQ credit: openai#535 @raahilshah + openai#569 @gowtham0992, fi…

63056a4

…x LeakyReLU author links

saml212 added a commit to saml212/parameter-golf that referenced this pull request Mar 24, 2026

pr593 base + xsa-all + selective prune, 83ms/step!

835dbdd

Novel: XSA-all(11) + selective ±1 pruning on openai#593 base. Parallel Muon gives 83ms/step = competition-grade throughput.

saml212 added a commit to saml212/parameter-golf that referenced this pull request Mar 24, 2026

NEW RECORD: 1.1154 beats openai#593 (1.1171) by 0.0017

a92f4e9

XSA-all(11) on openai#593 Parallel Muon stack. 83ms/step, 6923 steps. 15.94MB fits. Novel contribution: XSA-all + selective pruning.

saml212 mentioned this pull request Mar 24, 2026

Non-record: 11L XSA-all + Full GPTQ + Selective Pruning (val_bpb=1.1154, 3-seed) #609

Open

abaybektursun changed the title ~~Record: Full GPTQ + LeakyReLU² + Parallel Muon — val_bpb 1.1171 (3-seed mean, no TTT)~~ Record: Full GPTQ + LeakyReLU² + Parallel Muon + BigramHash 3072 (val_bpb 1.1163, 3-seed mean) Mar 24, 2026

Fix credits: GPTQ from openai#535 @raahilshah + openai#569 @gowtham0992…

7703e5e

…, LeakyReLU from @parinzee + @sofiabod

valerio-oai closed this Mar 24, 2026

vimeto added a commit to vimeto/parameter-golf that referenced this pull request Mar 24, 2026

fix: replace broken GPTQ with PR openai#593's proven implementation

dcfb69c

abaybektursun mentioned this pull request Mar 25, 2026

Non-record: Negative results — hardware alignment & quantization on 8xH100 #670

Open

3 tasks

0hq mentioned this pull request Mar 25, 2026

Illegal submissions megathread #677

Open

abaybektursun mentioned this pull request Mar 25, 2026

Record: Val-Calibrated GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.1142 (3-seed mean) #728

Open

This was referenced Mar 25, 2026

Record: 11L Full GPTQ + Multi-Order N-gram Backoff (fixed-alpha 0.9757 / entropy-adaptive 0.9605, 3-seed) #778

Open

Record: 11L XSA-all + Full GPTQ (Budget-Legal) + Parallel Muon + Selective Pruning (val_bpb: 1.1178, 3-seed mean) #634

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Full GPTQ + LeakyReLU² + Parallel Muon + BigramHash 3072 (val_bpb 1.1163, 3-seed mean)#593

Record: Full GPTQ + LeakyReLU² + Parallel Muon + BigramHash 3072 (val_bpb 1.1163, 3-seed mean)#593
abaybektursun wants to merge 5 commits intoopenai:mainfrom
abaybektursun:submission/full-gptq-leakyrelu-1.1170

abaybektursun commented Mar 24, 2026 •

edited

Loading

Uh oh!

valerio-oai commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

abaybektursun commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

3-Seed Results

Key Techniques

Architecture

Credits

Uh oh!

valerio-oai commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

abaybektursun commented Mar 24, 2026 •

edited

Loading